HIP Compilation Failure: LLVM Linking Issues & Fix
Hey everyone,
I ran into a compilation failure with the HIP backend during the linking stage, and it seems to be related to some LLVM issues. I wanted to share my experience and see if anyone else has encountered this or has any insights.
The Problem
The compilation process fails at this line in cudacpp.mk
:
$(GPUCC) -o $@ $(BUILDDIR)/check_sa_$(GPUSUFFIX).o $(LIBFLAGS) -L$(LIBDIR) -l$(MG5AMC_GPULIB) $(gpu_objects_exe) $(BUILDDIR)/CurandRandomNumberKernel_$(GPUSUFFIX).o ↪$(BUILDDIR)/HiprandRandomNumberKernel_$(GPUSUFFIX).o $(RNDLIBFLAGS)
The error messages are as follows:
ld.lld: error: /usr/lib/libm.so is incompatible with elf64-x86-64
ld.lld: error: /lib/libc.so.6 is incompatible with elf64-x86-64
ld.lld: error: /usr/lib/libc_nonshared.a(atexit.oS) is incompatible with elf64-x86-64
ld.lld: error: /usr/lib/libc_nonshared.a(at_quick_exit.oS) is incompatible with elf64-x86-64
ld.lld: error: /usr/lib/libc_nonshared.a(pthread_gettid_np.oS) is incompatible with elf64-x86-64
ld.lld: error: /usr/lib/libc_nonshared.a(pthread_atfork.oS) is incompatible with elf64-x86-64
ld.lld: error: /usr/lib/libc_nonshared.a(sched_getattr.oS) is incompatible with elf64-x86-64
ld.lld: error: /usr/lib/libc_nonshared.a(sched_setattr.oS) is incompatible with elf64-x86-64
ld.lld: error: /usr/lib/libc_nonshared.a(stack_chk_fail_local.oS) is incompatible with elf64-x86-64
ld.lld: error: /usr/lib/libc_nonshared.a(inet_ntop_chk.oS) is incompatible with elf64-x86-64
ld.lld: error: /usr/lib/libc_nonshared.a(inet_pton_chk.oS) is incompatible with elf64-x86-64
ld.lld: error: /lib/ld-linux.so.2 is incompatible with elf64-x86-64
When enabling verbosity with -v
, the output shows:
/usr/bin/hipcc -v -o check_hip.exe ./check_sa_hip.o -L../../lib -lmg5amc_common_hip -Xlinker -rpath='$ORIGIN/../../lib' -L../../lib -lmg5amc_gg_ttx_hip ./CommonRandomNumberKernel_hip.o ./RamboSamplingKernels_hip.o ./CurandRandomNumberKernel_hip.o ./HiprandRandomNumberKernel_hip.o -L/usr/lib/ -lhiprand
AMD clang version 19.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-6.4.1 25184 c87081df219c42dc27c5b6d86c0525bc7d01f727)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-6.4.1/lib/llvm/bin
Configuration file: /opt/rocm-6.4.1/lib/llvm/bin/clang++.cfg
Found candidate GCC installation: /usr/lib/gcc/i686-redhat-linux/11
Found candidate GCC installation: /usr/lib/gcc/x86_64-redhat-linux/11
Selected GCC installation: /usr/lib/gcc/x86_64-redhat-linux/11
Candidate multilib: .;@m64
Candidate multilib: 32;@m32
Selected multilib: .;@m64
Found CUDA installation: /usr/local/cuda-12.8, version
Found HIP installation: /opt/rocm-6.4.1/lib/llvm/bin/../../.., version 6.4.43483
"/opt/rocm-6.4.1/lib/llvm/bin/ld.lld" --hash-style=gnu --eh-frame-hdr -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o check_hip.exe /usr/lib/gcc/x86_64-redhat-linux/11/../../../../lib64/crt1.o /usr/lib/gcc/x86_64-redhat-linux/11/../../../../lib64/crti.o /opt/rocm-6.4.1/lib/llvm/lib/clang/19/lib/linux/clang_rt.crtbegin-x86_64.o -L../../lib -L../../lib -L/usr/lib/ -L/usr/lib/gcc/x86_64-redhat-linux/11 -L/usr/lib/gcc/x86_64-redhat-linux/11/../../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 -L/lib -L/usr/lib --enable-new-dtags ./check_sa_hip.o -lmg5amc_common_hip "-rpath=$ORIGIN/../../lib" -lmg5amc_gg_ttx_hip ./CommonRandomNumberKernel_hip.o ./RamboSamplingKernels_hip.o ./CurandRandomNumberKernel_hip.o ./HiprandRandomNumberKernel_hip.o -lhiprand -L/opt/rocm-6.4.1/lib/llvm/bin/../../../lib -rpath /opt/rocm-6.4.1/lib/llvm/bin/../../../lib -lamdhip64 -lstdc++ -lm /opt/rocm-6.4.1/lib/llvm/lib/clang/19/lib/linux/libclang_rt.builtins-x86_64.a -lgcc_s -lc /opt/rocm-6.4.1/lib/llvm/lib/clang/19/lib/linux/libclang_rt.builtins-x86_64.a -lgcc_s /opt/rocm-6.4.1/lib/llvm/lib/clang/19/lib/linux/clang_rt.crtend-x86_64.o /usr/lib/gcc/x86_64-redhat-linux/11/../../../../lib64/crtn.o
Root Cause Analysis
After doing some digging, I found a potentially related thread on the LLVM bugs page. The discussion there suggests that the presence of the -L/usr/lib/
flag can sometimes confuse the linker on certain systems. It appears that the linker is trying to use libraries from the wrong architecture, leading to these incompatibility errors. Specifically, the errors indicate that the linker is attempting to use 32-bit libraries (elf64-x86-64) in a 64-bit build, which is obviously a no-go.
The Potential Culprit: Conflicting Library Paths
The core issue seems to be the linker getting confused by the presence of -L/usr/lib/
. This flag tells the linker to search for libraries in /usr/lib/
, but on some systems, this directory might contain libraries for different architectures (e.g., both 32-bit and 64-bit). When the linker encounters a mix of libraries, it can pick the wrong ones, leading to the "incompatible with elf64-x86-64" errors. This is because the linker might be inadvertently picking up 32-bit libraries when it needs 64-bit ones, or vice versa.
Why -L/usr/lib/
Can Be Problematic
The -L
flag essentially adds a directory to the linker's search path. While this is generally helpful, it can cause issues if the directory contains libraries that aren't compatible with the target architecture. In this case, /usr/lib/
might contain a mix of 32-bit and 64-bit libraries, and the linker might be picking up the wrong ones. This is especially likely if the system's default library paths aren't properly configured, or if there are conflicting library versions in different locations.
Impact on the Build Process
This issue can manifest in various ways during the build process, but it typically shows up as linking errors. The errors usually indicate that certain libraries or object files are incompatible with the target architecture. In this specific case, the errors point to incompatibility with "elf64-x86-64," which means the linker is trying to use libraries that aren't meant for a 64-bit system.
Debugging the Issue
Debugging these kinds of linking errors can be tricky, but here are some steps you can take:
- Enable Verbose Linking: Use the
-v
flag with your compiler or linker command. This will print out the exact commands being executed, including the linker command and all its arguments. This can help you see which libraries and paths are being used. - Inspect the Linker Command: Look closely at the linker command to see if there are any suspicious
-L
flags or library paths. Pay attention to the order of the flags, as the linker searches paths in the order they are specified. - Check Library Architectures: Use the
file
command to check the architecture of the libraries being linked. For example,file /usr/lib/libm.so
will tell you if it's a 32-bit or 64-bit library. - Simplify the Linker Command: Try simplifying the linker command by removing unnecessary flags or paths. This can help you isolate the problematic part of the command.
- Consult Documentation: Refer to the documentation for your compiler, linker, and libraries. They might have specific recommendations or troubleshooting steps for linking issues.
The Workaround
The good news is that removing the -L/usr/lib/
flag seems to resolve the issue. With that flag removed, the following line compiles successfully:
"/opt/rocm-6.4.1/lib/llvm/bin/ld.lld" --hash-style=gnu --eh-frame-hdr -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o check_hip.exe /usr/lib/gcc/x86_64-redhat-linux/11/../../../../lib64/crt1.o /usr/lib/gcc/x86_64-redhat-linux/11/../../../../lib64/crti.o /opt/rocm-6.4.1/lib/llvm/lib/clang/19/lib/linux/clang_rt.crtbegin-x86_64.o -L../../lib -L../../lib -L/usr/lib/gcc/x86_64-redhat-linux/11 -L/usr/lib/gcc/x86_64-redhat-linux/11/../../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 -L/lib --enable-new-dtags ./check_sa_hip.o -lmg5amc_common_hip "-rpath=$ORIGIN/../../lib" -lmg5amc_gg_ttx_hip ./CommonRandomNumberKernel_hip.o ./RamboSamplingKernels_hip.o ./CurandRandomNumberKernel_hip.o ./HiprandRandomNumberKernel_hip.o -lhiprand -L/opt/rocm-6.4.1/lib/llvm/bin/../../../lib -rpath /opt/rocm-6.4.1/lib/llvm/bin/../../../lib -lamdhip64 -lstdc++ -lm /opt/rocm-6.4.1/lib/llvm/lib/clang/19/lib/linux/libclang_rt.builtins-x86_64.a -lgcc_s -lc /opt/rocm-6.4.1/lib/llvm/lib/clang/19/lib/linux/libclang_rt.builtins-x86_64.a -lgcc_s /opt/rocm-6.4.1/lib/llvm/lib/clang/19/lib/linux/clang_rt.crtend-x86_64.o /usr/lib/gcc/x86_64-redhat-linux/11/../../../../lib64/crtn.o
Steps to Reproduce and Verify the Fix
- Identify the Affected Line: Locate the failing line in your
cudacpp.mk
file. It should be similar to the one mentioned at the beginning of this post. - Backup the
cudacpp.mk
File: Before making any changes, create a backup of the file so you can revert if necessary. - Edit the File: Open
cudacpp.mk
in a text editor and find the line that includes the-L/usr/lib/
flag. This is the flag that tells the linker to search for libraries in the/usr/lib/
directory. - Remove the
-L/usr/lib/
Flag: Carefully remove the-L/usr/lib/
flag from the line. Make sure you don't accidentally delete any other parts of the line. - Save the File: Save the changes you've made to
cudacpp.mk
. - Clean the Build: To ensure a clean build, you might want to remove any previously compiled object files and executables. You can often do this by running a
make clean
command in your build directory. - Recompile: Run the compilation process again. This will recompile your code and link it without the
-L/usr/lib/
flag. - Verify Success: Check the output of the compilation process. If the issue is resolved, you should no longer see the "incompatible with elf64-x86-64" errors.
- Test the Executable: If the compilation succeeds, test the resulting executable to ensure it runs correctly. This is important to confirm that removing the flag didn't introduce any new issues.
Questions and Next Steps
This leads me to a few questions:
- Has this issue been discussed before? Is this a known problem? It seems like something that might pop up on different systems.
- Is this system-dependent, or is it a more general issue? I'm running this on my machine, but I'm wondering if others have seen it too.
- Does it depend on the ROCM LLVM version? I'm using a specific version, and I'm curious if this is a regression or a long-standing issue.
- Should we consider removing the
-L/usr/lib/
flags? It seems like it's only causing problems in this specific line, as the rest of the compilation process works fine without it.
I'd love to hear your thoughts and experiences on this! Let's figure out the best way to handle this going forward.
Further Investigation and Long-Term Solutions
While removing the -L/usr/lib/
flag provides an immediate workaround, it's essential to understand the underlying cause and implement a more robust long-term solution. Here are some avenues for further investigation:
- System Configuration Analysis: Investigate the system's library paths and configurations. Ensure that the correct library paths are set and that there are no conflicting library versions.
- Compiler and Linker Flags: Review the compiler and linker flags used in the build process. Identify any flags that might be contributing to the issue, such as
-L
flags that specify incorrect library paths. - ROCm Version Compatibility: Check the compatibility of the ROCm version with the system's operating system and other software components. Ensure that all components are compatible and up-to-date.
- LLVM Bug Reports: Monitor the LLVM bug tracker for any related issues. If the issue persists, consider filing a bug report with detailed information about the problem and steps to reproduce it.
- Community Discussions: Engage with the open-source community through forums, mailing lists, and other channels. Share your findings and seek input from other developers who might have encountered similar issues.
By thoroughly investigating the problem and collaborating with the community, we can develop more effective solutions and prevent similar issues from occurring in the future.
Conclusion
Guys, compilation failures can be a real headache, but by sharing our experiences and working together, we can often find solutions. In this case, the HIP backend compilation failure due to LLVM issues was traced to a conflicting library path. Removing the -L/usr/lib/
flag provided a workaround, but further investigation is needed for a long-term solution. I hope this post helps anyone else facing similar problems, and I look forward to hearing your thoughts and suggestions!